Road accidents in Australia carry enormous social and economic impacts, with an estimated cost of almost $30 billion each year. The aim of this analysis was to explore road deaths in Australia to identify trends in the data, and groups that may be more at risk of dying in road accidents. Data on the number of road deaths in Australia since 1989 were obtained from the Australian Road Deaths Database, supplemented with data from the Australian motor vehicle census of 2018, and population estimates for each age group and state in Australia. The road deaths dataset was explored for trends visually, and in order to identify potentially over-represented groups, the proportion of certain groups who died in road accidents was compared with the proportion of that group in the Australian population. These calculations were done for young men, Northern Territorians, motorcycle users, and heavy vehicles. For example, the proportion of road deaths who were young men was compared with the proportion of young men in the general population of Australia. Results indicated that young men, Northern Territorians, and motorcycle users are over-represented in road deaths, and that around midnight on Friday and Saturday night is a particularly dangerous time for young men to be on the road. These findings could be used to inform public policy, for example, by targeting young male drivers in road safety campaigns aimed at reducing the road toll.
Road accidents in Australia carry enormous social and economic impacts, with an estimated cost of $29.7 billion each year (Australian Automobile Association, 2017). A concerning trend was noticed from 2014 to 2017, with the road toll beginning to rise after decades of decline (Australian Automobile Association, 2017). The Australian Government has recently committed an additional $2.2 billion to road safety and established the Office of Road Safety to help address this issue (Prime Minister of Australia, 2019).
The Australian Road Deaths Database (ARDD) contains two datasets that are updated each month – the fatalities, and the fatal crashes datasets (Bureau of Infrastructure Transport and Regional Economics, 2019a). The aim of this study was to explore the fatalities dataset to identify trends in road deaths, and groups who may be more at risk of dying on Australian roads.
The ARDD fatalities dataset is published each month by the Bureau of Infrastructure Transport and Regional Economics. The April 2019 dataset contains records of 50,160 deaths that have occurred on Australian roads since January 1989 (Bureau of Infrastructure Transport and Regional Economics, 2019b). Data are collected each month from police reports to the state and territory road safety agencies. Each death in the dataset has associated information including a variable identifying the crash, the state where it occurred, temporal variables (year, month, day of the week, time of day), crash type (e.g., single, multiple, pedestrian), whether a heavy vehicle was involved (bus, heavy rigid, or articulated truck), the speed limit, the type of road user killed (e.g., driver, passenger, pedestrian), gender, age, the type of road (e.g., local road or highway), variables relating to the location (including the remoteness area, e.g., major city, outer regional), local government area, and whether the accident happened during the Easter or Christmas holiday periods.
Data from other sources were also used to supplement the analysis in this report. State population numbers from 1989 to 2016 were obtained from the Australian Bureau of Statistics (2016), and State population projections for 2017 and 2018 from the Australian Bureau of Statistics (2018c). Population statistics by age and gender for 2018 were sourced from the Australian Bureau of Statistics (2018a). Data estimating the total number of various registered vehicle types in Australia in 2018 (e.g., total numbers of cars, buses and motorcycles) were obtained from the Australian Bureau of Statistics (2018b).
The analysis was done using R (R Core Team, 2017), supplemented by the dplyr (Wickham, François, Henry, & Müller, 2019), ggplot2 (Wickham, 2009), mice (van Buuren & Groothuis-Oudshoorn, 2011) and lubridate (Grolemund & Wickham, 2011) packages.
The ARDD fatalities dataset was downloaded from the Australian Government Open Data website (Australian Govenment, No date). Missing data were coded as “-9” or “Other/-9” in the dataset. I also set Gender of “unspecified” as missing, since there was only one observation with this value.
This report analyses the data from January 1989 to December 2018 using the variables: state, month, year, day of the week, time of day, road user, gender, age, and bus, heavy rigid, and articulated truck involvement. State, day of the week, gender, road user, and bus, heavy rigid, and articulated truck involvement were coded as unordered factors. Time of day was converted to a numeric variable indicating the hour of the day (in 24-hour time). Age, year and month were coded as numeric integers. This dataset had 11 variables providing information on 49,730 motor vehicle deaths from January 1989 to December 2018.
State, year, month, day of the week, bus, and articulated truck involvement had no missing values. Time of day had 40 missing values, road user 77, gender 22, and age 84 missing values. In total, there were 218 observations with missing data for at least one of these variables (<1% of the dataset). Single imputation was used to assign missing values for these variables using the mice function from the mice library. Age and hour of the day were imputed using predictive mean matching, gender with logistic regression imputation, and road user with polytomous regression imputation. All variables apart from the heavy vehicle involvement variables were used to predict the missing values. Heavy rigid truck involvement had 20,493 missing values, but this analysis only used data from 2018 for the heavy vehicle involvement variables, of which none were missing, so missing values did not need to be imputed for this variable. I refer to the subsetted, imputed dataset as the “road deaths dataset” in this report.
library(tidyverse)
library(ggthemes)
library(lubridate)
library(cowplot)
library(mice)
library(readxl)
library(scales)
library(gganimate)
# Import road deaths data----
path = 'https://data.gov.au/data/dataset/5b530fb8-526e-4fbf-b0f6-aa24e84e4277/resource/fd646fdc-7788-4bea-a736-e4aeb0dd09a8/download/bitre_ardd_fatalities_apr_2019.csv'
road_data = read.csv(path, header = T, na.strings = c("", -9, 'Other/-9'))
# filter out 2019 data - not a complete year
road_data = filter(road_data, Year != 2019)
# Don't need Crash ID
road_data = road_data %>% select(-Crash.ID)
road_data$State = factor(road_data$State, levels=c('NSW', 'Vic', 'Qld', 'WA', 'SA', 'Tas', 'ACT', 'NT'))
road_data$Month = as.integer(road_data$Month)
road_data$Dayweek = factor(road_data$Dayweek, levels = c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'))
# extract hour of day from Time column
road_data$Bus.Involvement = as.factor(road_data$Bus.Involvement)
road_data$Heavy.Rigid.Truck.Involvement = as.factor(road_data$Heavy.Rigid.Truck.Involvement)
road_data$Articulated.Truck.Involvement = as.factor(road_data$Articulated.Truck.Involvement)
road_data$Gender[road_data$Gender == 'Unspecified'] = NA
road_data$Gender = droplevels(road_data$Gender)
road_data$Gender = as.factor(road_data$Gender)
road_data$Road.User = as.factor(road_data$Road.User)
truck_2018 = road_data %>%
filter(Year==2018) %>%
summarise(count = sum(is.na(Heavy.Rigid.Truck.Involvement)))
# how many observations had missing time, road_user, gender, age
missing_data = road_data %>%
filter(is.na(Gender) | is.na(Time) | is.na(Road.User) | is.na(Age)) %>%
select(State:Time, Bus.Involvement:Articulated.Truck.Involvement, Road.User:Age)
# now deal with time
time_of_day <- hm(road_data$Time, quiet = T)
road_data$time_of_day <- hour(time_of_day)
# select varibles that I'm using and impute missing----
road_data = road_data %>%
select(c(State:Dayweek, Bus.Involvement:Articulated.Truck.Involvement, Road.User:Age, time_of_day))
road_data = rename(road_data, bus_involve=Bus.Involvement, heavy_rigid_involve=Heavy.Rigid.Truck.Involvement,
articulated_involve=Articulated.Truck.Involvement, road_user=Road.User)
init = mice(road_data, maxit=0) # need to impute road user (77), gender (22), age (84), time (40)
meth = init$method
predM = init$predictorMatrix
# don't use bus involve, heavy rigid involve, or articulated involve pas predictor variables
predM[, c('bus_involve')]=0
predM[, c('heavy_rigid_involve')]=0
predM[, c('articulated_involve')]=0
# don't need to impute heavy rigid
meth[c('heavy_rigid_involve')]=""
# set methods for imputation
meth[c('road_user')]="polyreg"
meth[c('Gender')]="logreg"
meth[c('Age')]="pmm"
meth[c('time_of_day')]="pmm"
imputed = mice(road_data, method=meth, predictorMatrix=predM, m=1, printFlag = F)
road_data <- complete(imputed)
# create new variables age group and 5-year interval----
# make age discrete
# 0-16, 17-40, 41-60, 60+
road_data$Gender = factor(road_data$Gender, levels = c('Male', 'Female'))
road_data = road_data %>%
mutate(age_cat = cut(Age, breaks = c(-Inf, 16, 40, 60, Inf), labels = c('0-16', '17-40', '41-60', '>60')))
# create 5-year intervals as factor
five_years = c('1989 to 1993', '1994 to 1998', '1999 to 2003', '2004 to 2008', '2009 to 2013', '2014 to 2018')
road_data = road_data %>%
mutate(five_yr_interval = cut(Year, breaks = c(-Inf, 1993, 1998, 2003, 2008, 2013, Inf),
labels=five_years))
# Import state population data----
# from here: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3105.0.65.0012016?OpenDocument
state_pop = read_excel('data/historical_state_pop.xls', sheet=4, skip = 4)
# the order is nsw, vic, qld, sa, wa, tas, nt, act
names(state_pop)[1] = 'States'
state_pop = state_pop %>%
filter(str_detect(States, pattern="Estimated"))
States = c('NSW', 'Vic', 'Qld', 'SA', 'WA', 'Tas', 'NT', 'ACT', 'Total')
state_pop$States = States
# only need from 1989
state_pop = select(state_pop, c(1, 20:47))
# need to get pop projections for 2017, 2018
# from pop projections https://www.abs.gov.au/AUSSTATS/abs@.nsf/Lookup/3222.0Main+Features12017%20(base)%20-%202066?OpenDocument
# used Series B projections, which is based on current birth rate
nsw = c(7867052, 8001204)
vic = c(6320292, 6465890)
qld = c(4928412, 5013437)
sa = c(1723714, 1735172)
wa = c(2574762, 2598092)
tas = c(522279, 526930)
nt = c(247659, 249796)
act = c(411986, 420667)
total = c(24600777, 25015825)
state_pop_17_18 = rbind.data.frame(nsw, vic, qld, sa, wa, tas, nt, act, total)
names(state_pop_17_18) = c('2017', '2018')
state_pop = cbind(state_pop, state_pop_17_18)
# Import motorvehicle census data
# from 2018 mv census https://www.abs.gov.au/AUSSTATS/abs@.nsf/Lookup/9309.0Main+Features131%20Jan%202018?OpenDocument
mv_names = c('Passenger', 'CamperVan', 'LightCommercial', 'LightRigid', 'HeavyRigid', 'Articulated', 'Non-Freight', 'Bus', 'Motorcycle', 'Total')
mv_census = c(14330432, 66594, 3187131, 158032, 346966, 100694, 24165, 98565, 860700, 19173279)
# levels of road user are: Car/Truck, Cyclist, Motorcycle, Pedestrian - so compine into car/truck/bus vs MC
# are MC more at risk - can then normalize - remove cyclist, pedestrian etc from User
names(mv_census) = mv_names
mc = mv_census[9]
total = mv_census[10]
mc_prop = mc/total # MC account for 5% of vehicles on the road
bus = mv_census[8]
bus_prop = bus/total
heavy_rigid = mv_census[5]
heavy_rigid_prop = heavy_rigid/total
articulated = mv_census[6]
articulated_prop = articulated/total
prop_pop = rbind(mc_prop, bus_prop, heavy_rigid_prop, articulated_prop)
# Exploratory distribution plots
# plots showing the variation in Age, Gender, State, Year, Month, Dayweek, time_of_day, road_user
theme_set(theme_minimal())
age_dist = ggplot(road_data, aes(Age)) +
geom_histogram(binwidth = 5) +
ylab('') +
scale_y_continuous(labels = comma) +
xlab('Age (years)')
sex_p = ggplot(road_data, aes(Gender)) +
geom_bar() +
scale_y_continuous(labels = comma) +
ylab('')
state_p = ggplot(road_data, aes(State)) +
geom_bar() + # interesting NT has more than act or tas even though lowest population (ordered by population)
scale_y_continuous(labels = comma) +
ylab('')
year_p = ggplot(road_data, aes(Year)) +
geom_histogram(binwidth = 1) +
scale_x_continuous(breaks = c(1990, 1995, 2000, 2005, 2010, 2015), labels = c(1990, 1995, 2000, 2005, 2010, 2015)) +
scale_y_continuous(labels = comma) +
ylab('')
month_p = ggplot(road_data, aes(as.factor(Month))) +
geom_bar() +
ylab('') +
xlab('Month') +
scale_y_continuous(labels = comma) +
scale_x_discrete(labels = month.abb)
day_p = ggplot(road_data, aes(Dayweek)) +
geom_bar() +
scale_x_discrete(labels=c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")) +
scale_y_continuous(labels = comma) +
ylab("") +
xlab("Day of the Week")
time_p = ggplot(road_data, aes(time_of_day)) +
geom_histogram(binwidth = 1) +
ylab("") +
scale_x_continuous(breaks = c(0, 6, 12, 18)) +
scale_y_continuous(labels = comma) +
xlab("Hour of the Day (24 hour)")
road_data$road_user = fct_infreq(road_data$road_user)
user_p = ggplot(road_data, aes(road_user)) +
geom_bar() +
ylab("") +
xlab("Type of Road User") +
scale_y_continuous(labels = comma) +
scale_x_discrete(labels=c("Driver", "Passenger", "Pedestrian", "MC rider", "Cyclist", "MC passenger")) +
coord_flip()
plot_grid(year_p, month_p, day_p, time_p, age_dist, sex_p, state_p, user_p, ncol=3, labels=c(LETTERS[1:8]))
Figure 1. Density plots for variables in the road deaths dataset, except for the heavy vehicle variables. The y-axis for all plots shows the count, except for plot H, for which count is the x-axis. A bin width of 1 was chosen for year and hour of the day, and 5 for age. The x-axis for state is ordered from the largest to smallest population. MC = motorcycle.
Figure 1 shows the distribution of the variables year, month, day of the week, hour of the day, age, gender, state, and type of road user in the road deaths dataset. The plot for year shows that the number of deaths has significantly decreased since 1989. In recent years, the number of deaths increased from 2014 to 2016, but then decreased again in 2017 and 2018, with numbers for 2018 similar to 2014. Friday, Saturday and Sunday have more deaths than other days of the week, and the afternoon appears to be the most dangerous time to be on the road, with a peak in the distribution at around 3 pm, and a secondary peak at around midnight. Young people and males seem to be over-represented – there is a large peak in the distribution for age at around 20 years, and around 70% of deaths have been males. The plot for state generally shows that the number of deaths decreases with state population, apart from the Northern Territory (NT), which has the smallest population, but more deaths than both the Australian Capital Territory (ACT) and Tasmania. Buses were involved in 941 deaths, and articulated trucks in 5,071 deaths. Heavy rigid trucks were involved in 1,400 deaths, although, as mentioned, there were many missing values for this variable (>20,000).
Age and year were used to create two new discrete variables: “five-year interval” was created by dividing year into five-year segments, and “age group”, was created by splitting age into four groups, 0–16, 17–40, 41–60 and >60 years. These categories were chosen to reflect stages of life: child, young adult of driving age, middle-aged, and older adult.
State population data from 1989 to 2016 were obtained from the Australian Bureau of Statistics (2016), and population projections for 2017 to 2018 from the Australian Bureau of Statistics (2018c). These data were joined together and are referred to as the “state population dataset” in this report. The number of males in Australia aged 20 to 39 in 2018 and the total population of Australia for 2018 were obtained from the Australian Bureau of Statistics (2018a), and I refer to this as the “2018 population data”. The motor vehicle census for 2018 was obtained from the Australian Bureau of Statistics (2018b). Data from Table 1, 2018 was used, which reports the number of various different types of vehicles for the 19.2 million registered motor vehicles in Australia. I refer to this dataset as the “motor vehicle dataset”.
The exploratory plots in Figure 1 showed some interesting trends in road deaths. In particular, young people (Fig. 1E) and males (Fig. 1F) seem to be over-represented. This was further explored by plotting the number of deaths each year, after grouping by age group and gender. Since young men seemed to be over-represented, I calculated the proportion of the road deaths dataset in 2018 who were males aged 20 to 39, and compared this with the proportion of young men of that age in all of Australia using the 2018 population data.
Another trend in Figure 1 (C & D) was that weekends and afternoons seem to be more dangerous times to drive. To further explore this, I grouped the road deaths data grouped by five-year interval, hour of the day, day of the week, age group, and gender, counted the number of deaths for each group, and plotted these data.
NT seemed to be over-represented in the exploratory plots, since it had more road deaths than two states with higher populations (Figure 1G). To further explore the relationship between road deaths and state, I created a dataset that counted the number of deaths in each state every year. This was joined to the state population dataset and the combined dataset was used to calculate the number of deaths per 100,000 people for each state, by year, then plotted.
I was also interested in whether motorcycles or heavy vehicles were over-represented in this dataset. To investigate this, I calculated the proportion of deaths in 2018 with motorcycle, bus, heavy rigid, and articulated truck involvement in the road deaths dataset. The proportion of deaths with motorcycle involvement was calculated by counting the number of motorcycle rider and passenger deaths in 2018, then dividing by the total number of deaths for that year. The proportion of deaths for each of the heavy vehicle involvement variables was calculated, and the proportion of deaths for these vehicle types in the road deaths dataset was compared with the overall proportion of that type of vehicle from the Australian motor vehicle census of 2018, and plotted.
The number of deaths each year for females and males in each age group is depicted in Figure 2. There has been a general decrease in the number of deaths over time, especially for young men, and also females and children. The number of deaths for middle-aged men has remained fairly constant overtime, however, and although the numbers for older men decreased from around 1995, they have been increasing since 2010. The most striking feature of this plot is the number of young men who have died on Australian roads compared to other groups. To quantify the extent to which this group is over-represented, I calculated the proportion of deaths in 2018 who were males aged 20 to 39 in the road deaths dataset, and compared this with the proportion of males that age in the general population of Australia in 2018. Young men aged 20 to 39 accounted for 28% of deaths, but only 14% of the population of Australia in 2018 – twice as many as would be expected given the proportion in the general population.
# Age, gender plot----
ggplot(road_data, aes(x=Year, colour=age_cat, linetype=Gender)) +
geom_line(stat = 'count') +
scale_colour_colorblind() +
scale_x_continuous(breaks = c(1990, 1995, 2000, 2005, 2010, 2015), labels = c('1990', '1995', '2000', '2005', '2010', '2015')) +
labs(colour='Age Group (years)') +
ylab('Number of Deaths') +
theme_minimal()
Figure 2. The number of deaths each year for each age group and gender.
# Proportion of young males in Australia----
# proportion of population each age in 2018: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3101.0Jun%202018?OpenDocument
age_2018 = read_excel('data/pop_2018.xls', sheet = 7, skip = 4)
males_2018 = age_2018[2:23, c(1,10)]
females_2018 = age_2018[25:46, c(1,10)]
total_2018 = age_2018[48:69, c(1,10)]
colnames(males_2018) = c("age_group", "Males")
colnames(females_2018) = c("age_group", "Females")
colnames(total_2018) = c("age_group", "Total")
age_2018 = cbind.data.frame(males_2018, females_2018, total_2018)
age_2018 = age_2018[,c(1,2,4,6)]
age_2018 = age_2018 %>%
mutate(prop_male = Males/Total,
prop_female = Females/Total)
total = age_2018[22,4]
age_2018 = age_2018 %>%
mutate(prop_total = Total/total)
# so 20 to 39
prop_male_20_to_39 = age_2018 %>%
filter(age_group %in% c("20–24", "25–29", "30–34", "35–39")) %>%
mutate(prop_male_total = prop_total*prop_male) %>%
summarise(prop_male = sum(prop_male_total))
# so 14% of population aged 20 to 39 in 2018 were males
# but accounted for what proportion of deaths?
twenty_to_39_deaths = road_data %>%
filter(Year==2018) %>%
group_by(Age, Gender) %>%
tally()
total_deaths_2018 = twenty_to_39_deaths %>%
summarise(total=sum(n)) %>%
summarise(total=sum(total)) # 1141 deaths in 2018
age_names = c('0 to 19', '20 to 39', '>=40' )
age_deaths_2018 = twenty_to_39_deaths %>%
mutate(Age_20_39 = cut(Age, breaks = c(-Inf, 19, 39, Inf), labels = age_names)) %>%
group_by(Age_20_39, Gender) %>%
tally()
twenty_to_39_male_deaths_2018 = filter(age_deaths_2018, Age_20_39=='20 to 39', Gender=='Male')
# 319 deaths male
Prop_male_deaths_2018 = twenty_to_39_male_deaths_2018$n/total_deaths_2018$total
# Males in this age group account for 28% of deaths, but only 14% of population
The exploratory plots in Figure 1 (C & D) indicate that afternoons and weekends (Friday to Sunday) are the most dangerous times to drive. To further explore these relationships, I plotted the number of deaths for each five-year interval by age group and hour of the day faceted by day of the week and gender. The results are depicted in Figure 3. For the middle-aged and older age groups (both men and women), more deaths occur in the afternoon, but for young men there is a sharp spike in the number of deaths at around midnight on Friday and Saturday nights. There is also a similar trend for young women, although it is less pronounced. The numbers of deaths have clearly reduced for young men, but the trends have remained fairly stable over time for all age groups and genders. For example, the number of deaths for young men at around midnight on Saturday has substantially decreased over time, but there remains a spike in the number of deaths at this time for the most recent five-year interval (2014 to 2018).
# Time of day
road_data$hour_int = as.integer(road_data$time_of_day)
# 12 pm to 11am for plotting
road_data$hour_int = factor(road_data$hour_int, levels = c(0:23))
# animated by 5-year interval
year_hour_day_p = road_data %>%
group_by(five_yr_interval, hour_int, Dayweek, age_cat, Gender) %>%
summarise(count=n()) %>%
ggplot(aes(x=hour_int, y=count, group=age_cat, colour=age_cat, frame = as.character(five_yr_interval))) +
geom_line(alpha=0.5) +
scale_x_discrete(breaks = c(0,6,12,18), labels = c('0', '6', '12', '18')) +
xlab('Time (24 hour)') +
geom_smooth(se=F, size=0.5, method = 'loess') +
labs(colour='Age (years)') +
ylab('Number of Deaths') +
scale_color_colorblind() +
facet_grid(rows=vars(Gender), cols = vars(Dayweek)) +
theme_minimal() +
theme(panel.spacing.x=unit(0, "lines"),panel.spacing.y=unit(1, "lines")) +
ggtitle("Year: {closest_state}") +
transition_states(as.character(five_yr_interval),
transition_length = 2,
state_length = 1)
animate(year_hour_day_p, width = 900, height = 450)
Figure 3. The number of deaths each hour of the day by age group for females and males each day of the week for each five-year interval. The squiggly lines show the raw data, and loess smoothing lines were added to help to show trends in the data.
Another interesting trend in the exploratory plots in Figure 1 was that NT seems over-represented in number of deaths. It has the smallest population of the states, but has had more deaths than both ACT and Tasmania. To further explore whether NT is over-represented, I plotted the number of deaths per 100,000 people for each state by year for females and males in each age group. The results are presented in Figure 4, and indicate that NT is the most dangerous state to drive in for all age groups of both genders apart from older adults (>60 years).
# Deaths by State per 100,000----
deaths_by_year_state = road_data %>%
mutate(age_cat = fct_recode(age_cat,
'0-16 years' = '0-16',
'17-40 years' = '17-40',
'41-60 years' = '41-60',
'>60 years' = '>60')) %>%
group_by(State, Year, age_cat, Gender) %>%
tally(name = 'Deaths')
total_deaths = road_data %>%
group_by(Year, age_cat, Gender) %>%
tally(name = 'Deaths')
total_deaths$State = 'Total'
deaths_by_year_state = rbind.data.frame(deaths_by_year_state, total_deaths)
deaths_by_year_state$State = factor(deaths_by_year_state$State, levels = c('NSW', 'Vic', 'Qld', 'WA', 'SA', 'Tas', 'ACT', 'NT', 'Total'))
state_pop_long = state_pop %>%
gather(key='Year', value='Population', 2:31)
names(state_pop_long)[1] = 'State'
state_pop_long$Year = as.numeric(state_pop_long$Year)
state_pop_long$State = factor(state_pop_long$State, levels = c('NSW', 'Vic', 'Qld', 'WA', 'SA', 'Tas', 'ACT', 'NT', 'Total'))
# join deaths with state population info (so can normalize)
death_pop = full_join(deaths_by_year_state, state_pop_long, by=c('State', 'Year'))
# remove total for these graphs
death_pop_states = death_pop %>%
filter(State !='Total')
death_pop_states$State = fct_drop(death_pop_states$State)
# Normalize for state population
death_pop_states = death_pop_states %>%
mutate(death_proportion = Deaths / Population) %>%
mutate(death_prop_per_100k = death_proportion * 100000)
ggplot(death_pop_states, aes(x=Year, y=death_prop_per_100k, colour=State)) +
geom_line(alpha=0.5) +
geom_smooth(method = 'loess', se=F, size=0.5) +
ylab('Number of Deaths per 100,000') +
scale_x_continuous(breaks = c(1990, 1995, 2000, 2005, 2010, 2015), labels = c('1990', '1995', '2000', '2005', '2010', '2015')) +
scale_colour_colorblind() +
scale_y_log10() +
facet_grid(age_cat~Gender) +
theme_minimal()
Figure 4. The number of road deaths per 100,000 people each year for each state by gender and age group. Note the log (base 10) scale of the y-axis. The squiggly lines show the raw data, and loess smoothing lines were added to help to show trends in the data.
I was also interested in exploring whether certain types of vehicles were over-represented in the road deaths data, in particular, motorcycles and heavy vehicles. To investigate this, I calculated the proportion of deaths in 2018 involving motorcycles (rider or passenger), and for each of the heavy vehicles, and compared these to the proportion of these vehicle types reported in the motor vehicle census of 2018. The results are presented in Figure 5, and indicate that all of these vehicle types are over-represented. However, the heavy vehicles (buses, and heavy trucks) may spend more time on the road compared with the average vehicle. For example, a bus might drive for 8 hours a day, whereas most people may only drive their car to and from work each day. However, there is no reason to suspect that motorcycles spend disproportionately more time on the road compared to typical vehicles. According to the motor vehicle census of 2018, approximately 5% of registered vehicles on Australian roads were motorcycles, but motorcycle riders and passengers accounted for 17% of road deaths – over three times as many as would be expected given the proportion of motorcycles on the road. I hypothesised that a many of the motorcycle deaths may be young men, so calculated the percentage for this group – 65% of motorcycle deaths in the road deaths dataset were males aged 17–40 years.
# road user----
road_data = road_data %>%
mutate(User = ifelse(road_user=='Motorcycle pillion passenger'|road_user=='Motorcycle rider', 'Motorcycle',
ifelse(road_user=='Driver'|road_user=='Passenger', 'Car/Truck',
ifelse(road_user=='Pedestrian', 'Pedestrian',
ifelse(road_user=='Pedal cyclist', 'Cyclist', NA)))))
road_data$User = factor(road_data$User, levels = c('Car/Truck', 'Motorcycle', 'Pedestrian', 'Cyclist'))
# what proportion of deaths in 2018 were MC?
# remove pedestrian and cyclist - not included in MC census
users_2018 = road_data %>%
filter(Year==2018) %>%
group_by(User) %>%
tally()
mc_prop_2018 = users_2018$n[2]/sum(users_2018$n)
# Only 5% of vehicles on the road are mc, but they account for 17% of deaths
# proportion of trucks involved in crash compared with census 2018
# bus, heavy rigid, articulated trucks
# are any of these overrepresented?
df_2018 = road_data %>%
filter(Year==2018)
bus_deaths_2018 = summary(as.factor(df_2018$bus_involve))[2]
total = nrow(df_2018)
bus_prop_2018 = bus_deaths_2018/total # 0.02 of deaths, but 0.005 of vehicles on road # not heavily overrepresented (4 times as many) - and driving a lot of the time
heavy_rigid_deaths_2018 = summary(as.factor(df_2018$heavy_rigid_involve))[2]
heavy_rigid_prop_2018 = heavy_rigid_deaths_2018 / total # 0.06 of deaths (6%) and account for 0.018 (~2%) of vehicles (overrepresented 3 times as many)
articulated_deaths_2018 = summary(as.factor(df_2018$articulated_involve))[2]
articulated_prop_2018 = articulated_deaths_2018 / total # 0.079 of deaths (~8%) and account for 0.005 of vehicles (heavily overrepresented 16 times as many)
deaths_2018 = c(mc_prop_2018, bus_prop_2018, heavy_rigid_prop_2018, articulated_prop_2018)
props_2018 = cbind.data.frame(prop_pop, deaths_2018)
names(props_2018) = c('Population', 'Deaths')
props_2018$user = c('Motorcycle', 'Bus', 'Heavy Rigid Truck', 'Articulated Truck')
# need to make long
props_2018 %>%
gather('Group', 'Proportion', 1:2) %>%
mutate(Group = factor(Group, levels = c('Population', 'Deaths'))) %>%
ggplot(aes(x=factor(user, levels=c('Bus', 'Heavy Rigid Truck', 'Articulated Truck', 'Motorcycle')), y=Proportion, colour=Group, fill=Group)) +
geom_bar(stat='identity', position = 'dodge') +
scale_color_manual(values=c("#E69F00", "#000000")) +
scale_fill_manual(values=c("#E69F00", "#000000")) +
scale_x_discrete(labels=c('Bus', 'Heavy Rigid', 'Articulated', 'Motorcycle')) +
xlab('Type of Vehicle')
Figure 5. The proportion of deaths in 2018 with motorcycle and heavy vehicle involvement in the road deaths dataset (Deaths), compared to the proportion of that type of vehicle reported in the motor vehicle census of 2018 (Population).
# what proportion died in MC accidents are young men?
mc_sex_age = road_data %>%
filter(User == 'Motorcycle')
young_men_mc = mc_sex_age %>%
filter(Gender == "Male", age_cat == '17-40')
prop_young_men_mc = dim(young_men_mc)[1] / dim(mc_sex_age)[1]
The aim of this study was to identify trends and potential groups who may be at risk of dying on Australian roads. Results indicate that young men, Northern Territorians, and motorcycle users are over-represented in deaths on Australian roads, and around midnight on Friday and Saturday night is a particularly dangerous time to drive for young men.
A limitation of this study is that it was exploratory in nature, containing no statistical analysis. Also, due to time limitations, I was not able to analyse all of the variables in the road deaths dataset, notably the location variables, which may contain further insights into road deaths in Australia. For example, it would be interesting to investigate whether there are certain regions of NT that are particularly dangerous. Although it appeared that heavy vehicles were over-represented when compared to the census data, this was not a realistic comparison, since heavy vehicles probably spend more time on the road compared to a typical vehicle. A more sophisticated analysis using further supplemental data, perhaps with weightings for the various heavy vehicle types, may be able to further investigate whether heavy vehicles are over-represented in road deaths.
The results of this analysis could be used to inform public policy, for example, by targeting young male drivers in campaigns aimed at improving road safety in Australia.
Australian Automobile Association (2017). Cost of Road Trauma in Australia. Retrieved from https://www.aaa.asn.au/wp-content/uploads/2018/03/AAA-ECON_Cost-of-road-trauma-summary-report_Sep-2017.pdf
Australian Bureau of Statistics (2016). Australian Historical Population Statistics, 2016: Population Size and Growth. Retrieved from: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3105.0.65.0012016?OpenDocument
Australian Bureau of Statistics (2018a). Australian Demographic Statistics, Jun 2018: Australian Demographic Statistics Tables. Retrieved from: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/3101.0Jun%202018?OpenDocument
Australian Bureau of Statistics (2018b). Motor Vehicle Census, Australia, 31 Jan 2018. Retrieved from: https://www.abs.gov.au/AUSSTATS/abs@.nsf/DetailsPage/9309.031%20Jan%202018?OpenDocument
Australian Bureau of Statistics (2018c). Population Projections, Australia, 2017: Projected population, components of change and summary statistics. Retrieved from: https://www.abs.gov.au/AUSSTATS/abs@.nsf/Lookup/3222.0Main+Features12017%20(base)%20-%202066?OpenDocument
Australian Govenment (No date). Australian Government: data.gov.au. Retrieved from https://data.gov.au/data/dataset/5b530fb8-526e-4fbf-b0f6-aa24e84e4277/resource/fd646fdc-7788-4bea-a736-e4aeb0dd09a8/download/bitre_ardd_fatalities_apr_2019.csv
Bureau of Infrastructure Transport and Regional Economics (2019a). Australian Road Deaths Database. Retrieved from https://www.bitre.gov.au/statistics/safety/fatal_road_crash_database.aspx
Bureau of Infrastructure Transport and Regional Economics (2019b). Australian Road Deaths Database: Fatalities. Retrieved from: https://data.gov.au/data/dataset/5b530fb8-526e-4fbf-b0f6-aa24e84e4277/resource/fd646fdc-7788-4bea-a736-e4aeb0dd09a8/download/bitre_ardd_fatalities_apr_2019.csv
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1-25.
Prime Minister of Australia (2019). $2.2 Billion Boost to Road Safety [Press release]. Retrieved from https://www.pm.gov.au/media/22-billion-boost-road-safety
PSMA Australia (2015). Transport & Topography: Data Product Desription. Retrieved from: https://www.psma.com.au/sites/default/files/Transport-and-Topography-Product-Description1.pdf
R Core Team (2017). R: A Language and Environment for Statistical Computing [Computer software]. (Version 3.6.0). Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.r-project.org/
van Buuren, S., & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1-67.
Wickham, H. (2009). ggplot2: Elegant graphics for data analysis. New York: Springer-Verlag.
Wickham, H., François, R., Henry, L., & Müller, K. (2019). dplyr: A grammar of data manipulation (Version R package version 0.8.0.1). Retrieved from https://cran.r-project.org/package=dplyr
A work by Josh Myers
myers.josh@gmail.com